The course focuses on important statistical techniques utilised in research. It is hands-on, where I get to learn how to use various tools such as R, github, etc. The main theme is open science which is the new paradigm that is being pushed in today’s world.
This is the link to my repository
Describe the work you have done this week and summarize your learning.
This week focuses on data wrangling, regression analysis and model validation, In a survey, some questions were asked to understand the students’ approaches used in learning. Based on these, the following variables were extracted: gender, age, examination points, strategic approach, deep learning approach, attitude and surface learning approach.
The result of my analysis showed that there were about two times more female students respondents than their male counterparts. Overall, attitude seems to be the most determining factor towards the success of students in their examination. Strategic learning approach also seems to contribute to some extent to the success. While age showed a very low negative effect, it cannot be concluded to be a major factor as most of the respondents were young students below 26.
To validate my model, I used the QQ plot, residual vs fitted and residuals vs leverage. In making choice of my predictor variables, I examined the standard error, Multiple R squared(coefficient of determination) and the p value to check the significance of variables. ggpairs() function also came in handy when getting a broad overview of the distribution of the variables.
#read the students2014 data
lrn14<- read.table("C:/Users/oyeda/Desktop/OPEN_DATA_SCIENCE/IODS-project/data/learning2014.txt ", header=T, sep = "\t")
#Alternatively, I could also read the already wrangled data as below:
#lrn14<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/co#urse_2218/datasets/learning2014.txt ", header=T, sep = ",")
#structure of the data
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ deep : num 3.58 2.92 3.5 3.5 3.67 4.75 3.83 3.25 4.33 4 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 2.42 1.92 2.83 2.17 3 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 3.62 2.25 4 4.25 3.5 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
#dimension of the data
dim(lrn14)
## [1] 166 7
The data includes 7 variables which describes several attributes of students:
#load packages
library(ggplot2)
#install.packages("GGally")
library(GGally)
#create a pair plot to see the broad overview of the data
ggpairs(lrn14, aes(col=gender), lower = list(combo = wrap("facethist", bins = 20)))
#summary of the data
summary(lrn14)
## deep surf stra age
## Min. :1.58 Min. :1.580 Min. :1.250 Min. :17.00
## 1st Qu.:3.33 1st Qu.:2.420 1st Qu.:2.620 1st Qu.:21.00
## Median :3.67 Median :2.830 Median :3.185 Median :22.00
## Mean :3.68 Mean :2.787 Mean :3.121 Mean :25.51
## 3rd Qu.:4.08 3rd Qu.:3.170 3rd Qu.:3.620 3rd Qu.:27.00
## Max. :4.92 Max. :4.330 Max. :5.000 Max. :55.00
## attitude points gender
## Min. :1.400 Min. : 7.00 F:110
## 1st Qu.:2.600 1st Qu.:19.00 M: 56
## Median :3.200 Median :23.00
## Mean :3.143 Mean :22.72
## 3rd Qu.:3.700 3rd Qu.:27.75
## Max. :5.000 Max. :33.00
By and large, we have almost twice as much female respondents out of 166 students, aside those with examination point 0. The average age is about 26 years with, the youngest being 17 years and an old student of age 55. The ages of the respondents are mainly skewed to the right, showing they are mostly young students below 26 years. We can also see from the box plot that the age the age is non-normal and skewed to the right with some outliers.
The attitude, points and deep learning approach also have few outliers. Generally, the pairs show weak correaltions. However, attitude seems to be the most strongly correlated(although still weak) with points, for male and female. Men generally seem to have a slightly better attitude than their female counterpart, although, some male students show very poor attitude towards learning.
Both genders also appear to have average to very deep learning approach. Female students have a slightly better strategic approach. However, there appears to be not much distinctions between examinations points of male and female students. Most students seem to have more above average score while a few students showed very poor performance.
l_model<-lm(points~gender+age+attitude+deep+stra+surf, data = lrn14)
summary(l_model)
##
## Call:
## lm(formula = points ~ gender + age + attitude + deep + stra +
## surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4513 -3.2513 0.3836 3.5240 11.4869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.32352 5.25571 3.486 0.000633 ***
## genderM -0.06009 0.92983 -0.065 0.948551
## age -0.09606 0.05396 -1.780 0.076916 .
## attitude 3.44426 0.59775 5.762 4.19e-08 ***
## deep -1.04826 0.78408 -1.337 0.183155
## stra 0.95823 0.55156 1.737 0.084271 .
## surf -1.11001 0.84433 -1.315 0.190519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.265 on 159 degrees of freedom
## Multiple R-squared: 0.2312, Adjusted R-squared: 0.2022
## F-statistic: 7.971 on 6 and 159 DF, p-value: 1.588e-07
I attempted to use all the variables for the model and stepwise regression to help remove the redundant variables.
library(MASS) #load package
stepw_reg<-stepAIC(l_model, direction = "both")
## Start: AIC=558.34
## points ~ gender + age + attitude + deep + stra + surf
##
## Df Sum of Sq RSS AIC
## - gender 1 0.12 4408.0 556.35
## - surf 1 47.91 4455.8 558.14
## - deep 1 49.55 4457.5 558.20
## <none> 4407.9 558.34
## - stra 1 83.67 4491.6 559.46
## - age 1 87.88 4495.8 559.62
## - attitude 1 920.43 5328.3 587.82
##
## Step: AIC=556.35
## points ~ age + attitude + deep + stra + surf
##
## Df Sum of Sq RSS AIC
## - surf 1 47.82 4455.8 556.14
## - deep 1 49.66 4457.7 556.21
## <none> 4408.0 556.35
## - stra 1 88.28 4496.3 557.64
## - age 1 90.18 4498.2 557.71
## + gender 1 0.12 4407.9 558.34
## - attitude 1 999.58 5407.6 588.27
##
## Step: AIC=556.14
## points ~ age + attitude + deep + stra
##
## Df Sum of Sq RSS AIC
## - deep 1 27.03 4482.9 555.14
## <none> 4455.8 556.14
## + surf 1 47.82 4408.0 556.35
## - age 1 75.38 4531.2 556.92
## - stra 1 106.09 4561.9 558.04
## + gender 1 0.02 4455.8 558.14
## - attitude 1 1085.20 5541.0 590.32
##
## Step: AIC=555.14
## points ~ age + attitude + stra
##
## Df Sum of Sq RSS AIC
## <none> 4482.9 555.14
## - age 1 76.60 4559.5 555.95
## + deep 1 27.03 4455.8 556.14
## + surf 1 25.19 4457.7 556.21
## - stra 1 97.54 4580.4 556.71
## + gender 1 0.01 4482.9 557.14
## - attitude 1 1061.20 5544.1 588.41
After using stepwise regression, I ended up with three explanatory variables namely: ‘age’,‘attitude’ and ‘stra’
Now, let’s refit the model with these three variables.
l_model2<-lm(points ~ age + attitude + stra, data = lrn14)
summary(l_model2)
##
## Call:
## lm(formula = points ~ age + attitude + stra, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1199 -3.2006 0.3307 3.4131 10.7605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89473 2.64895 4.113 6.20e-05 ***
## age -0.08821 0.05302 -1.664 0.0981 .
## attitude 3.48143 0.56219 6.193 4.69e-09 ***
## stra 1.00324 0.53436 1.877 0.0623 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.072e-08
As shown by the model, the p value of ‘attitude’ is very low below 0.05 and with higher confidence that ‘attitude’ significantly affects the examination points. However, the coeffient of determination(Multiple R-squared) of the model is just (0.2182)21.82% which is relatively low. The standard error of age is quite low but age seems insignificant. Just as points and age showed a low negative correlation earlier, it can be seen that age has a slight negative effect(-0.08821) on the examination points while attitude, which showed slightly higher positive correlation with points earlier, has a positive effect(3.48143) on the examination points
I will remove the age and strategic approach because they are insignificant with p-values lower than 0.05.
Finally, I am left with predictors;‘attitude’ attitude seems to have positive effect on the exam points.
#Final model
lm_final<-lm(points~attitude, data = lrn14)
summary(lm_final)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Although, the coefficient of determination(Multiple R-Squared) reduced slightly to 19.06% from 21.82% and the standard error only increased from 5.26 to 5.32. The difference is not so considerable. The p-value is also well below 0.05.
Thus, I decided to use ‘attitude’. This is in accordance to parsimony whereby it is advisable to use as less predictors as possible for building models.
par(mfrow=c(2,2))
plot(lm_final, which = c(1,2,5))
As seen in the Normal Q-Q plot, the points fit well to the line, aside few deviations at the beginning and end. The errors can thus, be said to be normally distributed.
For the residuals vs fitted, it can be seen that there is a slight change as the values inceas. However, no observation seems to have clear leverage over others.
The data includes students’ secondary education accomplishment of two portuguese schools. The data attributes include student grades, demographic, social and school related features’) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details). This is the data source
Data was gotten from this source
# access the tidyverse libraries tidyr, dplyr, ggplot2
#install.packages("tidyr")
library(tidyr); library(dplyr); library(ggplot2); library(GGally);
#read the data
alc<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = ",", header = T)
colnames(alc) #names of columns
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
summary(alc) #summary
## school sex age address famsize Pstatus
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.442
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.034 Mean :0.2906
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.00 Min. :1.000 Min. :1.000
## yes:364 yes:121 1st Qu.:4.00 1st Qu.:3.000 1st Qu.:2.000
## Median :4.00 Median :3.000 Median :3.000
## Mean :3.94 Mean :3.223 Mean :3.113
## 3rd Qu.:5.00 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.00 Max. :5.000 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.00 Min. :1.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:3.000 1st Qu.: 0.000
## Median :1.000 Median :2.00 Median :4.000 Median : 3.000
## Mean :1.474 Mean :2.28 Mean :3.579 Mean : 5.319
## 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000 3rd Qu.: 8.000
## Max. :5.000 Max. :5.00 Max. :5.000 Max. :75.000
## G1 G2 G3 alc_use
## Min. : 3.00 Min. : 0.00 Min. : 0.00 Min. :1.000
## 1st Qu.: 8.00 1st Qu.: 8.25 1st Qu.: 8.00 1st Qu.:1.000
## Median :10.50 Median :11.00 Median :11.00 Median :1.500
## Mean :10.86 Mean :10.71 Mean :10.39 Mean :1.877
## 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :19.00 Max. :19.00 Max. :20.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:270
## TRUE :112
##
##
##
dim(alc) #dimension
## [1] 382 35
str(alc) #structure
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
attach(alc) #attach all the variables in the dataframe
# define a new logical column 'high_use'
alc$high_use<- alc$alc_use > 2
#alc <- mutate(alc, high_use = alc_use > 2) #alternative
hyp<- alc[,c("age", "sex", "absences","freetime","alc_use")]
#exploring my chosen variables
# draw a bar plot of each variable
hyp<-gather(hyp) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")
hyp + geom_bar()
The above shows the distribution of the chosen predictors and also scale of alcohol consumption. The ages of the students are mainly within 15-19 with few students aged 20 and 22. Generally, there are relatively few chronic alcohol consumers amongst the students. The students seem to have quite enough free time. The respondents include 198 female students and 184 male students.
hyp<- alc[,c("age", "sex", "absences","freetime","alc_use")]
#show the overview of the predictors and response variable(non-binomial)
plot_hyp <- ggpairs(hyp, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
plot_hyp #draw the plot
Here, we can see that the female students seem to be generally quite older and there are both male and female students that are much older than the general sample of the students. The absences are not so high but some chronic absentees amongst the students(both male and female). Most of the students are quite free but a few of the students are very busy. Largely, all the predictors show no correlation, hence, the issue of multicolinearity is not a problem here. However, these predictors still show very weak correlation with alcohol consumption. Nevertheless, it should be recalled that correlation is not causation. Hence, I will dig further to understand the effects of these predictors on alcohol consumption amongst students and see if they are significant.
The below are crosstabs that compare the predictors with alcohol consumption
#using the binomial response(high_use)
#it is also possible to use simple crosstabs e.g
#xtabs(high_use~age, data = alc)
#but below is a more comprehensive crosstab.
#Crosstabs
summarise(group_by(alc, age,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 12 x 4
## # Groups: age [?]
## age high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 15 FALSE 63 11.380952
## 2 15 TRUE 18 11.055556
## 3 16 FALSE 79 11.303797
## 4 16 TRUE 28 10.714286
## 5 17 FALSE 65 10.123077
## 6 17 TRUE 35 10.228571
## 7 18 FALSE 55 9.218182
## 8 18 TRUE 26 9.423077
## 9 19 FALSE 7 5.714286
## 10 19 TRUE 4 6.250000
## 11 20 FALSE 1 18.000000
## 12 22 TRUE 1 8.000000
#an alternative and preferrable way of doing the above
#alc %>% group_by(age, high_use) %>% summarise(count=n(), mean_grade = mean(G3))
summarise(group_by(alc, sex,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 4 x 4
## # Groups: sex [?]
## sex high_use count mean_grade
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 157 9.687898
## 2 F TRUE 41 10.414634
## 3 M FALSE 113 11.610619
## 4 M TRUE 71 9.971831
summarise(group_by(alc, absences,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 48 x 4
## # Groups: absences [?]
## absences high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 0 FALSE 94 8.372340
## 2 0 TRUE 22 8.318182
## 3 1 FALSE 2 12.000000
## 4 1 TRUE 1 15.000000
## 5 2 FALSE 54 12.296296
## 6 2 TRUE 13 11.076923
## 7 3 FALSE 2 11.000000
## 8 3 TRUE 4 13.000000
## 9 4 FALSE 36 11.666667
## 10 4 TRUE 15 10.200000
## # ... with 38 more rows
summarise(group_by(alc, freetime,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 10 x 4
## # Groups: freetime [?]
## freetime high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 1 FALSE 16 9.12500
## 2 1 TRUE 2 8.50000
## 3 2 FALSE 47 11.87234
## 4 2 TRUE 15 11.60000
## 5 3 FALSE 116 9.75000
## 6 3 TRUE 40 9.82500
## 7 4 FALSE 69 10.46377
## 8 4 TRUE 40 10.10000
## 9 5 FALSE 22 12.54545
## 10 5 TRUE 15 9.80000
# initialize a plot of 'age'
sex_age <- ggplot(data = alc, aes(x=age, col=sex))
# draw a bar plot of age by sex
sex_age + geom_bar() + facet_wrap("sex")
# initialize a plot of 'high_use'
hu_sex <- ggplot(data = alc, aes(x=sex, col=sex))
# draw a bar plot of high_use by sex
hu_sex + geom_bar() + facet_wrap("high_use")
Here, we can see that there are more female who do are not high consumers of alcohol compared to men. Men consume more alchols than their female counterparts.
# initialize a plot of 'high_use'
hu_ft <- ggplot(data = alc, aes(x=freetime, col=sex))
# draw a bar plot of 'high_use' by freetime
hu_ft + geom_bar() + facet_wrap("high_use")
Most students with more freetime seem not to consume alcohol as one would expect
# initialize a plot of 'high_use'
hu_ab <- ggplot(data = alc, aes(x=absences, col=sex))
# draw a bar plot of 'high_use' by freetime
hu_ab + geom_bar() + facet_wrap("high_use")
#few of the absentees are alcoholic but many are not.
# initialize a plot of 'high_use'
hu_ag <- ggplot(data = alc, aes(x=age, col=sex))
# draw a bar plot of 'high_use' by freetime
hu_ag + geom_bar() + facet_wrap("high_use")
some of the young students below 20 years are high consumers of alcohol, but less so of those that are not high alchol consumers.
# initialize a plot of 'high_use'
hu_rom <- ggplot(data = alc, aes(x=romantic, col=sex))
# draw a bar plot of 'high_use' by freetime
hu_rom + geom_bar() + facet_wrap("high_use")
For students in romantic relationships, they generally have lower number of them that are high consumers of alcohol. This is same with those in no romatic relationships.
**Below are boxplots to give clearer pictures, as explained earlier. ##Relationship of alcohol consumption with absences
# initialise a plot of high_use and absences
h_ab<- ggplot(alc, aes(x=high_use, y=absences,col=sex))
# define the plot as a boxplot and draw it
h_ab + geom_boxplot() + ggtitle("Student absences vs alcohol consumption")
There are a few chronic absentees amongst male and female students, but generally, most of the students have low absences.
# initialise a plot of high_use and age
h_ag<- ggplot(alc, aes(x=high_use, y=age,col=sex))
# define the plot as a boxplot and draw it
h_ag + geom_boxplot() + ggtitle("Students' age vs alcohol consumption")
# initialise a plot of high_use and freetime
h_fr<- ggplot(alc, aes(x=high_use, y=freetime,col=sex))
# define the plot as a boxplot and draw it
h_fr + geom_boxplot() + ggtitle("Student's freetime vs alcohol consumption")
# initialise a plot of high_use and sex
alc_sex<- ggplot(alc, aes(y=alc_use, x=sex,col=sex))
# define the plot as a boxplot and draw it
alc_sex + geom_boxplot() + ggtitle("Student's alcohol consumption by sex")
There are a few female students with quite high alcohol consumption but the alcohol consumption levels do not vary as much as they do amongst the male students.
high_use_mod1<- glm(high_use~ age + sex + absences + freetime , data= alc, family = "binomial")
summary(high_use_mod1)
##
## Call:
## glm(formula = high_use ~ age + sex + absences + freetime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5188 -0.8214 -0.5957 1.0480 2.1082
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.35260 1.76384 -3.035 0.002408 **
## age 0.15609 0.10221 1.527 0.126698
## sexM 0.89550 0.24773 3.615 0.000301 ***
## absences 0.07162 0.01802 3.974 7.08e-05 ***
## freetime 0.30118 0.12578 2.394 0.016644 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 417.53 on 377 degrees of freedom
## AIC: 427.53
##
## Number of Fisher Scoring iterations: 4
# coef(high_use_mod1)
# coefficients(high_use_mod1)
The predictors “age” and “freetime”are not significant. Hence, these could be considered as redundant variables and not have consierable effect on alcohol consumption compared with asences and sex which are significant. hence, I could accept the null hypothesis that age and freetime are not related to alcohol consumption. On the other hand, I reject the null hypothesis that male sex and absences are not related to alcohol consumption. They both have positive effects on alcohol consumption.
high_use_mod2<- glm(high_use~ sex + absences, data= alc, family = "binomial")
summary(high_use_mod2)
##
## Call:
## glm(formula = high_use ~ sex + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7368 -0.8501 -0.5838 1.0919 1.9899
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.83117 0.21956 -8.340 < 2e-16 ***
## sexM 0.99923 0.24179 4.133 3.59e-05 ***
## absences 0.07403 0.01811 4.089 4.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 425.79 on 379 degrees of freedom
## AIC: 431.79
##
## Number of Fisher Scoring iterations: 4
#options("contrasts")
check the overall effect of the variable by performing a likelihood ratio test
anova(high_use_mod1, high_use_mod2, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ age + sex + absences + freetime
## Model 2: high_use ~ sex + absences
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 377 417.53
## 2 379 425.79 -2 -8.2563 0.01611 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There seems to be no significant difference between the two models.
Hence, ‘sex’ and ‘absences’ are significant enough without the redundant variables.
let’s also see if “sex” is alos significant in the prediction
high_use_mod3<- glm(high_use~ absences, data= alc, family = "binomial")
anova(high_use_mod2, high_use_mod3, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ sex + absences
## Model 2: high_use ~ absences
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 379 425.79
## 2 380 443.69 -1 -17.907 2.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test shows that sex is important and would remain in the model.
Hence, final model is: high_use~ sex + absences
# compute odds ratios (odds_ra)
odds_ra <- exp(coef(high_use_mod2))
#odds_ra <- coef(high_use_mod2) %>% exp #alternaive
# compute confidence intervals (conf_int)
conf_int <- exp(confint(high_use_mod2))
#Conf_Int <- high_use_mod2 %>% confint() %>% exp #alternative
# print out the odds ratios with their confidence intervals
cbind(odds_ra, conf_int)
## odds_ra 2.5 % 97.5 %
## (Intercept) 0.1602264 0.1022869 0.2424271
## sexM 2.7161808 1.7015022 4.3985356
## absences 1.0768398 1.0414538 1.1176941
Here, we can see that Male sex and absences are positively associated with success and are more likely to affect alcohol consumption. The confident interval of Male sex is quite high compared to “absences” which is narrower and much more likely to affect alcohol consumption.
# predict() the probability of high_use
probs<- predict(high_use_mod2, type = "response")
add the predicted probabilities to ‘alc’
alc$prob_high_use <- probs
#alc <- mutate(alc, probability = probabilities)
use the probabilities to make a prediction of high_use, setting 0.5 as threshold
alc$predict_high_use<- (alc$prob_high_use)>0.5
#alc <- mutate(alc, prediction = prob_high_use>0.5)
see the first ten and last ten original classes, predicted probabilities, and class predictions
head(alc[,c("failures", "absences", "sex", "high_use", "prob_high_use", "predict_high_use")], n=10)
## failures absences sex high_use prob_high_use predict_high_use
## 1 0 6 F FALSE 0.1998897 FALSE
## 2 0 4 F FALSE 0.1772568 FALSE
## 3 3 10 F TRUE 0.2514562 FALSE
## 4 0 2 F FALSE 0.1566846 FALSE
## 5 0 4 F FALSE 0.1772568 FALSE
## 6 0 10 M FALSE 0.4771074 FALSE
## 7 0 0 M FALSE 0.3032349 FALSE
## 8 0 6 F FALSE 0.1998897 FALSE
## 9 0 0 M FALSE 0.3032349 FALSE
## 10 0 0 M FALSE 0.3032349 FALSE
select(alc, failures, absences, sex, high_use, prob_high_use, predict_high_use) %>% tail(10)
## failures absences sex high_use prob_high_use predict_high_use
## 373 1 0 M FALSE 0.3032349 FALSE
## 374 1 14 M TRUE 0.5509447 TRUE
## 375 0 2 F FALSE 0.1566846 FALSE
## 376 0 7 F FALSE 0.2119931 FALSE
## 377 1 0 F FALSE 0.1380993 FALSE
## 378 0 0 F FALSE 0.1380993 FALSE
## 379 1 0 F FALSE 0.1380993 FALSE
## 380 1 0 F FALSE 0.1380993 FALSE
## 381 0 3 M TRUE 0.3520937 FALSE
## 382 0 0 M TRUE 0.3032349 FALSE
table(high_use = alc$high_use, prediction = alc$predict_high_use)
## prediction
## high_use FALSE TRUE
## FALSE 263 7
## TRUE 89 23
The model rightly predicted 263 False high use and 23 True high_use of alcohol. It wrongly predicted 89 True high_use and 7 False high_use
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = prob_high_use, y = high_use, col= predict_high_use))
# define the geom as points and draw the plot
g + geom_point()
#The wrong preditctions were quite much
conf_mat<-table(high_use = alc$high_use, prediction = alc$predict_high_use)
conf_mat<-prop.table(conf_mat)
addmargins(conf_mat)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68848168 0.01832461 0.70680628
## TRUE 0.23298429 0.06020942 0.29319372
## Sum 0.92146597 0.07853403 1.00000000
#Alternatively, this can be done as shown below:
#addmargins(prop.table(table(high_use = alc$high_use, prediction = alc$predict_high_use)))
#table(high_use = alc$high_use, prediction = alc$predict_high_use) %>% prop.table() %>% addmargins()
mean(abs(alc$high_use-alc$prob_high_use)>0.5)
## [1] 0.2513089
My model has slightly lower error of 0.25 compared with the one on datacamp with 0.26 mean error.
The below is an alternative way by firstly defining the function to calculate the mean error.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob_high_use)
## [1] 0.2513089
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = high_use_mod2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2617801
THIS IS THE LINK TO MY DATA WRANGLING
#Get access to the libraries
library(MASS)
library(ggplot2)
library(GGally)
library(tidyr)
#install.packages("plotly")
library(tidyverse)
library(corrplot)
library(dplyr)
library(plotly)
data("Boston")
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
#structure of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#dimension of the data
dim(Boston)
## [1] 506 14
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
The data includes information about Housing Values in Suburbs of Boston. The data has 14 variables and 506 observations. The relevant information in the dataset is described below:
| Variable | Definition |
|---|---|
| crim | per capita crime rate by town |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft |
| indus | proportion of non-retail business acres per town |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
| nox | nitrogen oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling |
| age | proportion of owner-occupied units built prior to 1940 |
| dis | weighted mean of distances to five Boston employment centres |
| rad | index of accessibility to radial highways |
| tax | full-value property-tax rate per $10,000 |
| ptratio | pupil-teacher ratio by town |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
| lstat | lower status of the population (percent) |
| medv | median value of owner-occupied homes in $1000s |
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
#graphical overview of the data
#ggpairs(Boston)
#pairs(Boston)
#plot(Boston)
#summary of the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## ptratio black lstat medv
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046
## zn -0.3916785 0.17552032 -0.4129946 0.3604453
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252
## chas -0.1215152 0.04878848 -0.0539293 0.1752602
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208
## rm -0.3555015 0.12806864 -0.6138083 0.6953599
## age 0.2615150 -0.27353398 0.6023385 -0.3769546
## dis -0.2324705 0.29151167 -0.4969958 0.2499287
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867
## black -0.1773833 1.00000000 -0.3660869 0.3334608
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627
## medv -0.5077867 0.33346082 -0.7376627 1.0000000
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper")
From the above, we can see the positive and negative correlations. For instance, industrial land use and nitrogen oxide concentration are positively correlated. Industrial land use is also positively correlated with tax. Crime is positively correlated with index of accessibility to radial highways. “age” and “dis” are negatively correlated and “dis” and “tax” are slightly negatively correlated too. From the correlation plot, we can see other weak to strong negative and positive correaltios between the variables.
Linear discrimant analysis will be performed later, hence, there is need to scale the entire dataset. this is done by subtrating the columns means from the various columns and divide this difference by the standard deviation of the column.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled<-as.data.frame(boston_scaled)
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
To avoid confusion, I will be deleting old continuous variable “crim” for the newly created categorical variable “crime” - remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
#number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Here crime is made as the target variable while all other variables as the predictor. Linear discriminant analysis (LDA) which is a generalization of Fisher’s linear discriminant, is a method utilised for finding a linear combination of features that separates or characterises two or more objects’ or events’ classes. it is used in recognising patterns, machine and statistical learning. see more here.
Put differently, it is a classification (and dimension reduction) method which finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.
Linear discriminant analysis is akin to many other methods, such as principal component analysis. LDA can be visualized with a biplot.
lda.fit <- lda(crime~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2524752 0.2500000 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 1.04676222 -0.9362056 -0.15765625 -0.8699240 0.45258884
## med_low -0.08831467 -0.3086239 -0.07933396 -0.5867408 -0.14193024
## med_high -0.36667477 0.1352662 0.11748284 0.3544023 0.06136783
## high -0.48724019 1.0171960 -0.11163110 1.0114364 -0.47426429
## age dis rad tax ptratio
## low -0.8626519 0.9111734 -0.6774717 -0.7275875 -0.45257741
## med_low -0.4017297 0.4004026 -0.5517590 -0.4723571 -0.08406871
## med_high 0.3699339 -0.3595801 -0.4360651 -0.3377904 -0.23876801
## high 0.7933259 -0.8324565 1.6373367 1.5134896 0.77985517
## black lstat medv
## low 0.37907839 -0.77703973 0.521397711
## med_low 0.31390409 -0.14208418 0.001976064
## med_high 0.09971637 0.01740126 0.106024077
## high -0.87654249 0.91644439 -0.714749185
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.04425537 0.71690251 -1.01073237
## indus 0.07469924 -0.24814827 0.24046132
## chas -0.11568506 -0.09790813 0.06069535
## nox 0.34550071 -0.74278580 -1.36730915
## rm -0.11886924 -0.07063936 -0.15241233
## age 0.17718853 -0.25554093 -0.23482273
## dis -0.02324915 -0.22809853 0.26504226
## rad 3.37764352 0.96058008 -0.13851323
## tax 0.01817335 -0.02529717 0.73262419
## ptratio 0.07699443 0.01626260 -0.33418657
## black -0.12290599 0.01334185 0.09368523
## lstat 0.21789588 -0.22647709 0.47802809
## medv 0.18441432 -0.34913943 -0.03510619
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9496 0.0361 0.0143
LDA utilises the trained model to calculate the probabilities of the observations belonging to the various classes and thereafter, classifies the observations to the the class which is most likely(probable.)
In order to visualise the result, the function coined from the datacamp exercise will be used. This was originally derived from a comment in stackoveflow here
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
train$crime <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = train$crime, pch= train$crime)
lda.arrows(lda.fit, myscale = 2)
The above arrows depict the relationship between the original variables and the LDA solution.
correct_classes <- test[,"crime"]
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes , predicted = lda.pred$class )
## predicted
## correct low med_low med_high high
## low 10 12 2 0
## med_low 3 14 7 0
## med_high 0 3 20 2
## high 0 0 0 29
# load MASS and Boston
library(MASS)
data('Boston')
boston_standard <- scale(Boston)
dist_eu <-dist(boston_standard, method= "euclidean")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_man <- dist(boston_standard, method="manhattan")
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
K-means is a popularly used clustering method. It is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. The distance matrix is counted automatically by the kmeans() function. source - run k-means algorithm on the dataset
km <-kmeans(Boston, centers = 3)
pairs(Boston[6:8], col = km$cluster)
K-means needs the number of clusters as an argument. There are many ways to look at the optimal number of clusters and a good way might depend on the data you have.
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function set.seed() can be used to deal with that. source
set.seed(123)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
From the plot, we can see that the value of changed drastically at about 2, hence, the centers will be set as 2 - k-means clustering
km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
from the above, it is hard to see the variables, hence, I’ll select some of them for plotting.
pairs(Boston[4:7], col = km$cluster)
Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)
boston_standard2<-scale(Boston)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_standard2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(boston_standard2, centers = 7)
#convert km to dataframe
boston_standard2<-as.data.frame(boston_standard2)
lda.fit_clus<- lda(km$cluster~., data=boston_standard2)
# plot the Boston dataset with clusters
pairs(boston_standard2[,3:7], col = km$cluster)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit_clus, dimen = 2, col = classes, pch= classes)
lda.arrows(lda.fit, myscale = 2)
Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', col=train$crime)
I got a better visualisation by using the crime classes as the colour representation.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', col=km$cluster)